This report is intended to showcase and log all observations and insights from the projection of our Extended IMD datasets onto the IMD basis.
The Extended IMDbasis project is a follow-up of the IMD basis, which main goal was to develop a method through which we could understand the aetiological relationships among multiple Immune mediated diseases (IMD) using genome-wide association studies (GWAS) summary statistic data. Understanding the shared (and differential) pathogenic patterns across complex diseases may provide useful insights that can help improving therapeutical intervention. More details about the method can be found at our paper Informed dimension reduction of clinically-related genome-wide association summary data characterises cross-trait axes of genetic risk.
This project intends to expand the approach used in the IMD basis by:
In this report, we’ll focus on objective 2, by trying to extract insights and spot errors on already-processed files. All files were processed through a custom pipeline that included determining REF (reference, or non-effect allele) and ALT (effect allele) alleles, check that all relevant columns for projections were present, calculation of missing columns when possible, and liftover to hg38. After this, files were filtered by the IMD basis SNP manifest (a list of SNPs used to create the initial IMD basis, proved to be enough to capture most of the biological signal in the datasets), thus retaining only 556 SNPs, and flipping alleles when necessary (this was carried out using Projecting_dbs/Reducing_datasets_to_SNPmanifest.R). Finally, I projected all files and splitted the results in two tables: Projected_table_XXX.tsv and QC_table_XXX.tsv (where XXX stand for the date in which they were created). In addition, I created Extra_info_XXX.tsv, a modified version of Main_table containing useful information for analysis, such as N of samples, ancestry, chip, and extended trait name.
First, let’s load the required libraries and files.
library(data.table)
library(cupcake)
library(pheatmap)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
library(ggplot2)
library(ggrepel)
library(gridExtra)
proj.table <- fread("Projected_table_20200317.tsv")
QC.table <- fread("QC_table_20200317.tsv")
extra.info <- fread("Extra_info_20200317.tsv")
QC.table <- merge.data.table(QC.table, extra.info, all.x = TRUE)
Let’s take a first glance at our projected datasets by doing some exploratory stats. The first concern is the number of SNPs in the manifest that we found in the files too, as this will have direct consequences for latter analyses.
hist(QC.table$nSNP, main = "Number of matching SNPs (All files)", xlab = "N matching SNPs")
Most of the datasets have a high number of shared SNPs, which is good. There are a number of them that have 300-350 SNPs. I suspect that part of them are ImmunoChip studies, since Immunochip is a targeted array comprising less SNPs than a typical genome-wide array.
hist(QC.table[QC.table$Chip != "ImmunoChip", ]$nSNP, main = "Number of matching SNPs w/o Immunochip", xlab = "N matching SNPs", breaks = 12)
hist(QC.table[QC.table$Chip == "ImmunoChip", ]$nSNP, main = "Number of matching SNPs (Immunochip only)", xlab = "N matching SNPs")
We observe that, although ImmunoChip studies have lower number of matches, they’re not the only ones. Let’s take a look by type of trait
par(mfrow = c(2,2))
hist(QC.table[QC.table$Trait_class == "IMD", ]$nSNP, main = "IMD", xlab = "N matching SNPs")
hist(QC.table[QC.table$Trait_class == "BMK", ]$nSNP, main = "Biomarkers", xlab = "N matching SNPs")
hist(QC.table[QC.table$Trait_class == "INF", ]$nSNP, main = "Infection-related traits", xlab = "N matching SNPs")
hist(QC.table[QC.table$Trait_class == "PSD", ]$nSNP, main = "Psychiatric traits", xlab = "N matching SNPs")
More stats!
summary(QC.table)
## Trait nSNP overall_p mscomp
## Length:470 Min. : 3.0 Min. :0.000000 Length:470
## Class :character 1st Qu.:556.0 1st Qu.:0.000025 Class :character
## Mode :character Median :556.0 Median :0.200000 Mode :character
## Mean :531.4 Mean :0.369026
## 3rd Qu.:565.0 3rd Qu.:0.725000
## Max. :568.0 Max. :1.000000
## NA's :2
## File_ID Trait_long Trait_class N0
## Length:470 Length:470 Length:470 Min. : 15.0
## Class :character Class :character Class :character 1st Qu.: 940.2
## Mode :character Mode :character Mode :character Median : 20209.5
## Mean : 57392.0
## 3rd Qu.: 94292.2
## Max. :824006.0
##
## N1 N SNP_number Chip
## Min. : 0.0 Min. : 165 Min. : 10000 Length:470
## 1st Qu.: 0.0 1st Qu.: 1000 1st Qu.: 8958989 Class :character
## Median : 230.5 Median : 33167 Median :15581317 Mode :character
## Mean : 4934.5 Mean : 62339 Mean :13103860
## 3rd Qu.: 2599.5 3rd Qu.: 96377 3rd Qu.:16147927
## Max. :107769.0 Max. :898130 Max. :29504835
## NA's :49
## Population
## Length:470
## Class :character
## Mode :character
##
##
##
##
QC.table[is.na(QC.table$overall_p),]
We see that of all 468 studies, 2 of them failed to be projected due to low SNP count.
These two files are special. MDD_Wray is a Meta-analysis, containing data from multiple studies. This particular file contained 23andMe data, which due to privacy reasons, they only made public 10k SNPs, which explains the low match. Fortunately, we have another file of the same meta-analysis, including many more SNPs.
Regarding T2D_Gaulton, a MetaboChip was used for genotyping, which contains ~46k SNPs. Being a targeted array with low number of SNPs genotyped, it makes sense that only 11 matched.
For the projected ones, we see how many, and what proportion of all files have less than 95%, 80%, and 50% of matching SNPs, respectively
QC.table <- QC.table[!is.na(QC.table$overall_p),]
c(lessthan95 = nrow(QC.table[QC.table$nSNP < 556*.95,]), lessthan80 = nrow(QC.table[QC.table$nSNP < 556*.8,]), lessthan50 = nrow(QC.table[QC.table$nSNP < 556*.5,]))
## lessthan95 lessthan80 lessthan50
## 63 42 6
c(lessthan95 = nrow(QC.table[QC.table$nSNP < 556*.95,])/nrow(QC.table), lessthan80 = nrow(QC.table[QC.table$nSNP < max(QC.table$nSNP)*.8,])/nrow(QC.table), lessthan50 = nrow(QC.table[QC.table$nSNP < max(QC.table$nSNP)*.5,])/nrow(QC.table))
## lessthan95 lessthan80 lessthan50
## 0.13461538 0.08974359 0.01282051
For the following analyses, we’ll focus on files that came from a Genome-wide array (ie. not Immunochip), and have at least 80% of matching SNPs. This will guarantee a certain degree of confidence in our observations, as well as statistical consistency (Statistical terminology to be improved/made more accurate!).
FT80 <- QC.table[QC.table$nSNP/max(QC.table$nSNP) >= .8 & !grepl("ImmunoChip",QC.table$Chip, ignore.case = TRUE), ]
basis.traits <- c("Asthma", "^CD$", "^CEL$", "IgA_Neph", "^LADA$", "^MS$","^PBC$","^PSC$","^RA$", "^SLE$", "^T1D$","^UC$", "^VIT$") # So we keep basis traits too
FT80.proj.table <- proj.table[grepl(paste(c(FT80$Trait, basis.traits), collapse = "|"), proj.table$Trait),]
FT80.proj.table <- merge(FT80.proj.table, FT80[,c("Trait", "Trait_class")], by = "Trait", all.x = TRUE) # So we don't lose original basis traits
FT80.proj.table[is.na(FT80.proj.table$Trait_class), ]$Trait_class <- "IMD" # Since original basis traits are all IMD
Less80 <- QC.table[QC.table$nSNP/max(QC.table$nSNP) < .8, ]
Less80.proj.table <- proj.table[grepl(paste(Less80$Trait, collapse = "|"), proj.table$Trait),]
Since we projected many studies, the resulting PC P-values should be adjusted for family-wise error rates. For that we’ll compute the FDR by trait class (IMD, BMK, and INF), and by PC (PC1,…,PC13), and apply a FDR threshold of 1%. PCs with significant differences from control (\(\delta\)) in either direction will be marked with ’*’ in the following heatmaps.
FT80.proj.table[, FDR.PC.trait:=p.adjust(P, method = "BH"), by=c("Trait_class", "PC")]
FT80.proj.table[, stars:=ifelse(!is.na(FDR.PC.trait) & FDR.PC.trait<0.01,"*","")]
A good QC measure is to check whether SNPs in case-control studies have their SE proportional to sample size. In order to do this, we’ll first select case-control studies, and see what are the most popular SNPs.
QC.casecontrols <- FT80[FT80$N1 > 0,]
main_df <- data.frame()
for(i in QC.casecontrols$Trait){
x <- fread(paste("Filtered_datasets/",i,"-ft.tsv", sep = ""))
x$filename <- i
main_df <- rbind(main_df, x, fill= T)
}
filtered_df <- main_df[main_df$P < 1e-5, ]
mainSNPS <- as.data.table(table(filtered_df$SNPID))
manifest <- fread("Manifest_build_translator.tsv")
manifest[,pid19:=paste(CHR19, BP19, sep = ":")]
manifest <- manifest[,c(1,8)]
manifest <- merge(manifest, copy(cupcake::SNP.manifest), by.x = "pid19", by.y = "pid")
mainSNPS <- merge(mainSNPS, manifest[,c(1:4,6)], by.x = "V1", by.y="SNPID")
names(mainSNPS)[4:5] <- c("REF", "ALT")
setorder(mainSNPS, -N, ld.block)
head(mainSNPS, n = 50)
hist(mainSNPS$N, breaks = 15)
QC.N0N1.table <- merge(main_df, QC.table[,c(1, 8:10)], by.x = "filename", by.y = "Trait")
QC.N0N1.table$Z <- QC.N0N1.table$BETA/QC.N0N1.table$SE
QC.N0N1.table$N0N1Calc <- (QC.N0N1.table$N0 + QC.N0N1.table$N1)/(as.numeric(QC.N0N1.table$N0) * as.numeric(QC.N0N1.table$N1))
From this table we can select a number of SNPs to check in greater depth. For example:
rs2476601 is located in the PTPN22 gene and also known as R620W, C1858T, or 1858C>T, may influence risk for multiple autoimmune diseases, such as Rheumatoid Arthritis, Type-1 diabetes, Autoimmune thyroiditis, and Systemic lupus erythematosus.
rs2476601seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs2476601",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs2476601",], abs(BETA) > 2.5 | log(SE) < -5))+
theme_classic()
rs2476601zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs2476601",], aes(x = BETA, y = abs(Z), label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs2476601",], abs(Z) > 4), size = 3)+
geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
geom_hline(yintercept = 4, linetype = "dashed", colour = "red")+
theme_classic()
grid.arrange(rs2476601seplot,rs2476601zplot, nrow=2)
We see a clear outlier - MDD_Howard_29662059_1. This dataset has an outstandingly low SE for what we would expect by its sample size. This GWAS was made on UKBB samples. Other outliers in this and other SNPs include UTI,
We also see in this and subsequent plots that Tian_28928442 datasets also have less SE than expected. These GWASs used 23 and Me data.
Regarding the Vulcano plot, we see what we would expect: A allele (left) confers increased risk for several IMDs, such as RA, T1D and VIT, whereas G allele (right) confers increased risk for Crohn’s disease, as well as elevated levels of certain biomarkers (White blood cells, granulucytes, Neutrophils, Eosinophils, etc.), not considered in this case-control analysis.
rs8067378 is associated with childhood asthma.
rs8067378seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs8067378",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs8067378",], abs(BETA) > 2.5 | log(SE) < -5))+
theme_classic()
rs8067378zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs8067378",], aes(x = BETA, y = abs(Z), label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs8067378",], abs(Z) > 5), size = 3)+
geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
geom_hline(aes(yintercept =5), linetype = "dashed", colour = "red")+
theme_classic()
grid.arrange(rs8067378seplot,rs8067378zplot, nrow=2)
rs7130588 is a regulatory region variant, associated with asthma and atopic dermatitis.
rs7130588seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7130588",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7130588",], abs(BETA) > 2.5 | log(SE) < -5 | log(SE) > -1))+
theme_classic()
rs7130588zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7130588",], aes(x = BETA, y = abs(Z), label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7130588",], abs(Z) > 4), size = 3)+
geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
theme_classic()
grid.arrange(rs7130588seplot,rs7130588zplot, nrow=2)
rs11065987 is an intergenic variant, associated with platelet count, total cholesterol, and others. Ensembl, SNPedia.
rs11065987seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs11065987",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs11065987",], abs(BETA) > 2.5 | log(SE) < -5))+
theme_classic()
rs11065987zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs11065987",], aes(x = BETA, y = abs(Z), label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs11065987",], abs(Z) > 4), size = 3)+
geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
theme_classic()
grid.arrange(rs11065987seplot,rs11065987zplot, nrow=2)
rs12924729 is an intron variant, associated with risk to PBC (G allele). Ensembl.
rs12924729seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs12924729",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs12924729",], abs(BETA) > 2.5 | log(SE) < -5))+
theme_classic()
rs12924729zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs12924729",], aes(x = BETA, y = abs(Z), label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs12924729",], abs(Z) > 4.5), size = 3)+
geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
geom_hline(aes(yintercept = 4.5), linetype = "dashed", colour = "red")+
theme_classic()
grid.arrange(rs12924729seplot,rs12924729zplot, nrow=2)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
rs1801274 is a missense variant in the Fc fragment of IgG, low affinity IIa, receptor (CD32) FCGR2A gene. rs1801274(C) encodes the arginine (R) allele, with the (T) allele encoding the variant histidine (H). FcgR isoforms expressed on immune system cells have been linked to the pathogenic consequences triggered by autoantibodies or immune complexes in autoimmune diseases such as rheumatoid arthritis (RA) and systemic lupus erythematosus (SLE), as well as to the efficacy of some immunotherapeutic treatments such as rituximab.
rs1801274seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1801274",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1801274",], abs(BETA) > 2.5 | log(SE) < -5))+
theme_classic()
rs1801274zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1801274",], aes(x = BETA, y = abs(Z), label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1801274",], abs(Z) > 4), size = 3)+
geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
theme_classic()
grid.arrange(rs1801274seplot,rs1801274zplot, nrow=2)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
rs7574865, a SNP in the third intron of the STAT4 gene, has been reported in a large study of Swedes to be associated with both rheumatoid arthritis (RA) and lupus (SLE). Among other studies, it has been confirmed in a meta-analysis of 8 studies totaling 7,381 patients and over 10,000 controls from both European and Asian populations. The risk allele (oriented to the dbSNP entry) is (T); the odds ratio associated the presence of a risk allele was 1.3 for rheumatoid arthritis and 1.55 for lupus (SLE). SNPedia
rs7574865seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7574865",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7574865",], abs(BETA) > 2.5 | log(SE) < -5))+
theme_classic()
rs7574865zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7574865",], aes(x = BETA, y = abs(Z), label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7574865",], abs(Z) > 4), size = 3)+
geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
theme_classic()
grid.arrange(rs7574865seplot,rs7574865zplot, nrow=2)
rs1050152 is a SNP in the SLC22A4 gene known as L503F, it has been associated with an autoimmune disease, specifically Crohn’s disease, being T the risk allele. SNPedia.
rs1050152seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1050152",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1050152",], abs(BETA) > 2.5 | log(SE) < -5))+
theme_classic()
rs1050152zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1050152",], aes(x = BETA, y = abs(Z), label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1050152",], abs(Z) > 4), size = 3)+
geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
theme_classic()
grid.arrange(rs1050152seplot,rs1050152zplot, nrow=2)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
rs1893217 is a SNP in PTPN2, linked to Crohn’s disease and type-1 diabetes. SNPedia.
rs1893217seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1893217",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1893217",], abs(BETA) > 2.5 | log(SE) < -5))+
theme_classic()
rs1893217zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1893217",], aes(x = BETA, y = abs(Z), label = filename)) +
geom_point()+
geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1893217",], abs(Z) > 2), size = 3)+
geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
geom_hline(aes(yintercept = 2), linetype = "dashed", colour = "red")+
theme_classic()
grid.arrange(rs1893217seplot,rs1893217zplot, nrow=2)
To begin with, we want to take a look at traits that are significantly different from \(\delta\) overall.
hmcol <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7", "#F7F7F7", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))(200)
PCorder <- c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13")
FT80overall <- FT80
FT80overall[, FDR.overall:=p.adjust(overall_p, method = "BH"), by = "Trait_class"]
FT80overall <- FT80overall[FT80overall$FDR.overall < 0.05, ]
FT80overall.proj.table <- FT80.proj.table[grepl(paste(FT80overall$Trait, collapse = "|^"), FT80.proj.table$Trait), ]
Moverall <- acast(FT80overall.proj.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
Moverallstars <- acast(FT80overall.proj.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
Moverall <- Moverall[,PCorder]
Moverallstars <- Moverallstars[,PCorder]
Moverall_col <- data.frame(FT80overall[grepl(paste(row.names(Moverall), collapse = "|"), FT80overall$Trait),], stringsAsFactors = F)
row.names(Moverall_col) <- Moverall_col$Trait
Moverall_col <- Moverall_col["Trait_class"]
pheatmap(Moverall, cluster_cols = FALSE, display_numbers = Moverallstars, fontsize_number = 11, color = hmcol, annotation_row = Moverall_col)
*** Observations
CD/UC/IBD cluster: There is a clear IBD-related cluster at the top of the heatmap, as we observed previously (although not as clear-cut in IBD subtypes as before). PSC (Primary Sclerosing Cholangitis) groups in this cluster, as we previously saw. In addition, related traits from FinnGen like ulcerative enterocholitis (UEC) and ulcerative ileocolitis (UILC) group within this cluster, as we’d expect. IBD FinnGen and UC Brant show weaker signal than their IBD/UC counterparts, which may explain why they’re apart from them. In the case of Brant, that GWAS was performed in African Americans. PC1 and PC3 are strong in the whole cluster, pointing in the same direction, whereas PC10 seems to split appart CD and UC, leaving IBD somewhere in the middle.
For FinnGen, despite being Europeans, Finnish are a very particular population, shaped by low effective sample sizes and numerous bottlenecks across their populational history, which sets them apart from the rest of European populations. This may play an important role on the distribution of FinnGen studies in these and other projection heatmaps, and we’ll surely need to take it into account in the future.
NMO and SSC: Neuromyelitis Optica and IgG+ NMO group together with Systemic sclerosis, showing a very similar projection pattern across all 13 PCs. NMO was previously considered a subtype of MS, which is coherent with its proximity in the clustering, and with SSc (although SSc affects different tissues, like connective tissue in the skin and internal organs, but not nervous tissue, apparently). NMO-SSc relationship has been mencioned a few times in the literature, with at least two cases of patients diagnosed for IgG+ NMO and later rediagnosed as SSC positive (Franciotta et al., 2011; Deeb et al., 2019). It also groups with European RA (Okada 1 and 3) but not with the 3 datasets in Japanese RA (Okada2, Kubo, and Ishigaki). SSC is also a rheumatic disease, which may help explain these relationships. We observe further down a couple of RA FinnGen datasets that group with SJOS and Still disease (SDAO), this lack of grouping with other European RA datasets might be explained by what I stated above. Also, NMO groups with SLE and Myastenia gravis, with which it has been previously associated. However, other associated diseases such as Sjögren’s syndrome (SJOS) and sarcoidosis (SARC), look to be further apart. A review from 2015 on NMO overlap with other diseases: Freitas & Guimaraes, 2015.
Arteritis (GCA, TAR) cluster: Although not significant for any PC, we observe a small cluster containing the three datasets we have on arteritis (Giant cell arteritis (GCA), Giant cell arteritis with polymyalgia rheumatica (GCAPMR), and temporal arteritis (TAR)).
Coeliac disease: Coeliac disease is on its own, as we saw earlier. There are two separated CEL datasets, the FinnGen one (N1 = 800) showing very weak signal.
Asthma/Allergic rhinitis cluster: Asthma in many of its forms groups with other allergic disorders, such as Allergic rhinitis (ALR), Allergic pollinosis (ALPOL), allergic asthma (ASTAL), childhood allergy (ALLCO)…and nasal polyps (NASP). This cluster seem to be characterized by a moderately strong signal at PC13, which we previously saw was related to eosinophil levels. This cluster doesn’t group with Eosinophil-related datasets, which have a much weaker signal at PC13. This is probably due to the differences in the nature of the trait (quantitative v. qualitative).
TBC
Since 215 traits is a lot to take a look at the same time, we’ll start by looking at a specific set of traits. Specifically, those present in the original basis, so we can have an overview of how new studies behave when projected onto a basis made from the same traits, and also to spot possible errors and unexpected behaviour in files. Note that some of the files included here are the same that were used to build the basis, so we can expect some exact matches.
We’ll start by filtering our dataframe by the traits in the basis:
all.basis.traits <- c("Asthma", "AST","^CD(_|$)", "CEL", "IgA_Neph", "IGAN", "LADA", "^MS(_|$)","PBC","PSC","^RA(_|$)", "^SLE(_|$)", "T1D","^UC(_|$)", "VIT(_|$)")
basis.table <- FT80.proj.table[grepl(paste(all.basis.traits, collapse = "|"), FT80.proj.table$Trait),]
Mbasis <- acast(basis.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
Mstars <- acast(basis.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
Mbasis <- Mbasis[,PCorder]
Mstars <- Mstars[,PCorder]
pheatmap(Mbasis, cluster_cols = FALSE, display_numbers = Mstars, fontsize_number = 12, color = hmcol)
We observe several things. In general, studies tend to group by their trait, with Crohn’s disease and Ulcerative colitis grouping close together (and with PSC, as we previously saw). However we see some interesting stuff:
Now we see that the basis works well for the same traits used to build it, let’s expand our view to all IMD diseases.
IMD <- FT80[FT80$Trait_class == "IMD",]$Trait
IMD <- paste(IMD, collapse = "|")
IMD.table <- FT80.proj.table[grepl(IMD, FT80.proj.table$Trait),]
MIMD <- acast(IMD.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
MIMDstars <- acast(IMD.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MIMD <- MIMD[,PCorder]
MIMDstars <- MIMDstars[,PCorder]
pheatmap(MIMD, cluster_cols = FALSE, display_numbers = MIMDstars, fontsize_number = 10, color = hmcol)
As the basis was trained in European populations, populations with other backgrounds are expected to show differences in strength signal, possibly having an influence on clustering. We’ll take a look at datasets restircted to contain Europeans.
IMDeur <- FT80[FT80$Trait_class == "IMD" & grepl("*European*", FT80$Population),]$Trait
IMDeur.table <- FT80.proj.table[grep(paste(IMDeur, collapse = "|^"), FT80.proj.table$Trait),]
MIMDeur <- acast(IMDeur.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
MIMDeurstars <- acast(IMDeur.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MIMDeur <- MIMDeur[,PCorder]
MIMDeurstars <- MIMDeurstars[,PCorder]
pheatmap(MIMDeur, cluster_cols = FALSE, display_numbers = MIMDeurstars, fontsize_number = 10, color = hmcol)
One type of study we are paying special attention are those of immune biomarker levels, by their role in defence against infections and in autoimmunity. We remove all H2P2 project datasets for now, since they deserve a special look on their own
bmks <- FT80[FT80$Trait_class == "BMK", ]$Trait
bmks <- bmks[grep("^HP", bmks, invert = TRUE)]
bmks.table <- FT80.proj.table[grepl(paste(bmks, collapse = "|"), FT80.proj.table$Trait),]
Mbmk <- acast(bmks.table[,c(1,2,3)], Trait ~ PC)
## Using Delta as value column: use value.var to override.
Mbmkstars <- acast(bmks.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
Mbmk <- Mbmk[,PCorder]
Mbmkstars <- Mbmkstars[,PCorder]
pheatmap(Mbmk, cluster_cols = FALSE, display_numbers = Mbmkstars, fontsize_number = 10, color = hmcol)
Just like we did before, let’s take a look only at European datasets
bmkeurs <- FT80[FT80$Trait_class == "BMK" & grepl("European", FT80$Population), ]$Trait
bmkeurs <- bmkeurs[grep("^HP", bmkeurs, invert = TRUE)]
bmkeurs.table <- FT80.proj.table[grepl(paste(bmkeurs, collapse = "|"), FT80.proj.table$Trait),]
Mbmkeur <- acast(bmkeurs.table[,c(1,2,3)], Trait ~ PC)
## Using Delta as value column: use value.var to override.
Mbmkeurstars <- acast(bmkeurs.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
Mbmkeur <- Mbmkeur[,PCorder]
Mbmkeurstars <- Mbmkeurstars[,PCorder]
pheatmap(Mbmkeur, cluster_cols = FALSE, display_numbers = Mbmkeurstars, fontsize_number = 10, color = hmcol)
We’ll now project IMDs together with biomarkers, to see how they cluster together.
MIMDbmkeur <- rbind(Mbmkeur, MIMDeur)
MIMDbmkeurstars <- rbind(Mbmkeurstars, MIMDeurstars)
MIMDbmkeur <- MIMDbmkeur[,PCorder]
MIMDbmkeurstars <- MIMDbmkeurstars[,PCorder]
MIMDbmkeur_col <- data.frame(FT80[grepl(paste(row.names(MIMDbmkeur), collapse = "|"), FT80$Trait),], stringsAsFactors = F)
row.names(MIMDbmkeur_col) <- MIMDbmkeur_col$Trait
MIMDbmkeur_col <- MIMDbmkeur_col["Trait_class"]
pheatmap(MIMDbmkeur, cluster_cols = FALSE, display_numbers = MIMDbmkeurstars, fontsize_number = 10, color = hmcol, annotation_row = MIMDbmkeur_col)
Most biomarkers cluster together, rather than with IMD traits. This may be due to a more subtle effect size for these biomarker trait, compared with those of IMDs, as we’ve previously seen. Some interesting things:
We also collected data on infection-related traits, to analyze the other side of the coin.
INF <- FT80[FT80$Trait_class == "INF",]$Trait
INF.table <- FT80.proj.table[grepl(paste(INF, collapse = "|"), FT80.proj.table$Trait),]
MINF <- acast(INF.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
MINFstars <- acast(INF.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MINF <- MINF[,PCorder]
MINFstars <- MINFstars[,PCorder]
pheatmap(MINF, cluster_cols = FALSE, display_numbers = MINFstars, fontsize_number = 10, color = hmcol)
Although the basis was built using IMDs, we want to explore its capacity to identify the relationships between other types of diseases, such as psychiatric diseases.
PSD <- FT80[FT80$Trait_class == "PSD",]$Trait
PSD.table <- FT80.proj.table[grepl(paste(PSD, collapse = "|"), FT80.proj.table$Trait),]
MPSD <- acast(PSD.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
MPSDstars <- acast(PSD.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MPSD <- MPSD[,PCorder]
MPSDstars <- MPSDstars[,PCorder]
pheatmap(MPSD, cluster_cols = FALSE, display_numbers = MPSDstars, fontsize_number = 10, color = hmcol)
Despite not having plenty of significant PCs, it is surprising the degree of clustering that we observe within disease (See AD, Alzheimer’s, except for Early-Onset, MDD [Major depression disorder], BPD [Bipolar disorder], or SCZ [Schizophrenia]) or among related diseases (eg, dementia in AD groups with AD). Of course, the clustering it’s not pergect, and we see several mismatches (BPD_Hou, which is in Europeans, as all the other BPD, and has a big N1 [7467]), as well as big discordances among some studies on the same trait (see Obsessive-compulsive disorder).
One of the most interesting group of datasets that we have is the one produced by the H2P2 Project (Wang et al., 2018). In this study, Wang and colleagues built a catalog of cellular genome-wide association studies (GWAS) comprising 79 infection-related phenotypes in response to 8 pathogens (non-typhi Salmonella, S.typhi, Chlamidia trichromatis, Staphylococcus aureus, Candida albicans, Mucor circinelloides, and Toxoplasma gondii) in 528 lymphoblastoid cell lines.
H2P2 <- FT80[grepl("HP[0-9]", FT80$Trait), ]
H2P2.table <- FT80.proj.table[grepl(paste(H2P2$Trait, collapse = "|"), FT80.proj.table$Trait),]
MH2P2 <- acast(H2P2.table[,c(1,2,3)], Trait ~ PC)
## Using Delta as value column: use value.var to override.
MH2P2stars <- acast(H2P2.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MH2P2 <- MH2P2[,PCorder]
MH2P2stars <- MH2P2stars[,PCorder]
pheatmap(MH2P2, cluster_cols = FALSE, display_numbers = MH2P2stars, fontsize_number = 10, color = hmcol)
We first projected these datasets using EMP_Beta, and we observed huge differences in signals, owing to differences in Beta scale across different datasets. I attempted to correct this by adjusting all datasets by the populational variance in each study, but as we see we still have problems, as no clear signal is distinguishable, and it looks very similar to where we started. This may be due to a number of things (eg. “over”-adjustment of certain files, as for some the factor was very small, leading to very big betas). We should carefully check!